The idea behind the article was a documentary about the life of famous “con” artist Mr. Beltracchi link! I often heard saying: “Invest in fine Art, is a great investment!” in the media. Also, many private banks offer client services such as Art Consulting. The service offers advice on buying and selling fine art.
This analysis is an attempt to show evidence art investor might generate using the actual transaction data.
But how? Structured statistics, with the actual transactions, trends or patterns are not existent.
In order to find useful data, programmers are equipped with handy tools - especially, the web scraping techniques.
The primary source of information is the website of Slovak national gallery auction house: https://www.soga.sk (aka SOGA). The website is filled with data that can be categorized as:
The content of the website will provide the fundamental dataset for this analysis.
First, let’s gather the data and build a database containing all the data available from the website.
But, before we even start web scraping process, lets check if the website allows this behavior. More information about using robot.txt here
library(robotstxt)
robotstxt::robotstxt("soga.sk") %>% print()
Great! The website enables the scraping of the data.
Next, lets identify the elements we want to scape. Selectorgadget identifies web elements containing auction information.
url <- "http://www.soga.sk/aukcie-obrazy-diela-umenie-starozitnosti/aukcie/vysledky-aukcii"
result <- url %>% read_html() %>%
html_nodes("p.results") %>%
html_text()
time <- url %>% read_html() %>%
html_nodes("p.about") %>%
html_text()
link <- url %>% read_html() %>%
html_nodes("h2 a") %>%
html_attr("href")
df <- cbind(time, result,link) %>% as.tibble()
This chunk of code records following table:
df %>% head() %>% knitr::kable()
| time | result | link |
|---|---|---|
| 138. zimná aukcia výtvarných diel a starožitností, 05.12.2017 | Na aukcii bolo vydražených 159 diel (55%) v celkovej hodnote 499 745 €. | /aukcie-obrazy-diela-umenie-starozitnosti/aukcie/138-zimna-aukcia-vytvarnych-diel-a-starozitnosti |
| 137. večerná aukcia, 14.11.2017 | Na aukcii bolo vydražených 45 diel (50%) v celkovej hodnote 378 900 €. | /aukcie-obrazy-diela-umenie-starozitnosti/aukcie/137-vecerna-aukcia |
| 136. jesenná aukcia výtvarných diel, 26.09.2017 | Na aukcii bolo vydražených 152 diel (57%) v celkovej hodnote 198 940 €. | /aukcie-obrazy-diela-umenie-starozitnosti/aukcie/136-jesenna-aukcia-vytvarnych-diel |
| 135. letná aukcia výtvarných diel a starožitností, 13.06.2017 | Na aukcii bolo vydražených 164 diel (50%) v celkovej hodnote 276 625 €. | /aukcie-obrazy-diela-umenie-starozitnosti/aukcie/135-letna-aukcia-vytvarnych-diel-a-starozitnosti |
| 134. večerná aukcia, 25.04.2017 | Na aukcii bolo vydražených 47 diel (51%) v celkovej hodnote 665 600 €. | /aukcie-obrazy-diela-umenie-starozitnosti/aukcie/134-vecerna-aukcia |
| 133. jarná aukcia výtvarných diel, 07.03.2017 | Na aukcii bolo vydražených 129 diel (64%) v celkovej hodnote 212 230 €. | /aukcie-obrazy-diela-umenie-starozitnosti/aukcie/133-jarna-aukcia-vytvarnych-diel |
Perfect! The successful scrape generated the table with information about: * column time: name of the action & date of the auction * column result: # of auctioned items, % of sold items, auction turnover * column link: contains the weblink to the actual auction.
Unfortunately, the data are still locked in the records (strings). The Information is extraced using string manipulation (stringr, rebus) into separate columns:
# extract time information
ptrn <- ", " %R% one_or_more(DIGIT) %R% "." %R% one_or_more(HEX_DIGIT) %R% "." %R% one_or_more(HEX_DIGIT)
df$c_time <- df$time %>% str_extract(pattern = ptrn) %>% str_replace(", ","") %>% lubridate::dmy()
# extract name information
df$c_name <- df$time %>% str_replace(ptrn,"")
# extract link information
df$c_link <- paste0("http://www.soga.sk/",df$link)
# extract percentage information
ptrn_2 <- "(" %R% one_or_more(DIGIT) %R% "%)"
df$c_rate <- df$result %>% str_extract(pattern = ptrn_2) %>% str_replace("%","") %>% as.numeric()
# clean items information
ptrn_3 <- one_or_more(DIGIT) %R% " diel"
df$c_items <- df$result %>% str_extract(pattern = ptrn_3) %>% str_replace(" diel","") %>% as.numeric()
# clean price information
ptrn_4 <- one_or_more(DIGIT) %R% SPACE %R% one_or_more(DIGIT) %R% " €"
df$c_price <- df$result %>% str_extract(pattern = ptrn_4) %>%
str_replace(" ","") %>%
str_replace("€","") %>%
as.numeric()
# tidy dataframe & calc average price
df_tidy <- df %>%
select(contains("c_")) %>%
mutate(year = lubridate::year(c_time),
quarter = lubridate::quarter(c_time),
qtr = paste0(year,".",quarter),
c_price_avg = c_price / c_items)
df_tidy %>%
select(-year,-qtr,-quarter) %>%
head() %>%
DT::datatable()
Clean dataset reveals important information about auction house performance over the period of years 2001-2017.
df_summary <- function(data, ...) {
group_var <- quos(...)
data %>%
group_by(!!!group_var) %>%
summarise(s_price = sum(c_price),
s_items = sum(c_items),
c_price_avg = mean(c_price_avg))
}
# chart overview
df_tidy %>%
df_summary(year) %>%
rename(average_price = c_price_avg ,
number_of_items = s_items,
auction_turnover = s_price) %>%
gather(Ratio, Value, 2:length(.)) %>%
ggplot(aes(year, Value, group = Ratio)) +
geom_line(alpha = .3) +
theme_minimal(base_family = "Verdana",base_size = 16) +
facet_wrap(~Ratio,scales = "free") +
tidyquant::geom_ma(n = 4,color = "red", linetype = 1, size = 1) +
labs(title = "SOGA: Overview ",
subtitle = "by indicator", caption = "source: www.soga.sk", y = "")
Daata shows SOGA turnover is growing steadily from 2004. Additionally, the average price per painting sold at auction is growing. Average price grew from ~1200€ to current ~5000€ in 14 years. This means either more expensive paintings are being sold or the prices of the artworks are growing. A calculation reveals 11% CAGR in prices per paiting for the timeperiod.
A byproduct of the scrape is the URL to every single auction in SOGA history (for example here http://www.soga.sk//aukcie-obrazy-diela-umenie-starozitnosti/aukcie/134-vecerna-aukcia). This link leads to every single item auctioned at that particular auction.
Next step is taken in order to: * scrape all the websites dedicated to listing items sold at one particular auction (user need to click several times next page in order to see all the items) * scrape the URLs of the particular items traded on the auction (download the individual URL link of the items) * scrape the details regarding starting price, final price, artwork name etc.
These steps can be executed using following code:
extract_auction_length <- function(url) {
ptrn <- "page=" %R% one_or_more(DIGIT)
num <- url %>% read_html() %>%
html_nodes("#auctionArtworks~ .pager .last") %>%
html_attr("href") %>%
str_extract(ptrn) %>%
str_replace("page=","") %>%
as.numeric()
df <- paste0(url,"?page=")
df_seq <- seq(1:num)
df_final <- paste0(df,df_seq)
return(df_final)
} # download the auction websites urls
extract_page_content <- function(url) {
print(url)
Sys.sleep(sample(seq(1, 3, by=0.001), 1))
article_list <- url %>% read_html() %>%
html_nodes("h2") %>%
html_nodes("a") %>%
html_attr("href")
article_list_corrected <- paste0("http://www.soga.sk",
article_list)
return(article_list_corrected)
} # download the auction items urls
extract_article_content <- function(url) {
print(url)
Sys.sleep(sample(seq(1, 3, by=0.001), 1))
art_name <- url %>% read_html() %>%
html_nodes("h2 a") %>%
html_text("href")
art_piece <- url %>% read_html() %>%
html_nodes(".wrapper a") %>%
html_text("href")
art_cat <- url %>% read_html() %>%
html_nodes(".col1") %>%
html_text("href") %>% as.tibble()
art_values <- url %>% read_html() %>%
html_nodes(".col2") %>%
html_text("href") %>% as.tibble()
art_df <- bind_cols(art_cat,art_values) %>% as.tibble()
art_df$name <- art_name
art_df$piece <- art_piece
return(art_df)
} # download the auction item content
df_all_pages <- map(df$c_link,extract_auction_length) %>% unlist()
df_all_pages_content <- map(df_all_pages,extract_page_content) %>% unlist()
df_all_pages_content_vec <- df_all_pages_content %>% pull()
df_all_article_content_sf <- map(df_all_pages_content_vec,safely(extract_article_content))
The downloaded wide raw data have the following structure:
df_soga <- df_soga %>%
rename(variable = value, value = value1) %>%
drop_na()
str(df_soga)
## Classes 'tbl_df', 'tbl' and 'data.frame': 47635 obs. of 4 variables:
## $ variable: chr "Poradové číslo:" "Vyvolávacia cena:" "Konečná cena:" "Poradové číslo:" ...
## $ value : chr "1" "400 €" "320 €" "2" ...
## $ name : chr "CPIN ŠTEFAN (1919 - 1971)" "CPIN ŠTEFAN (1919 - 1971)" "CPIN ŠTEFAN (1919 - 1971)" "HÁLA JAN (1890 - 1959)" ...
## $ piece : chr "Dievča" "Dievča" "Dievča" "Dievča na lúke (Leto)" ...
The variable column contains the following information:
Importantly, everyday items (thing sculptures) are sold besides paintings. These items will be removed as they are not the target of this analysis.
For more broader audience I will translate the collum names into the English language.
df_soga$variable <- if_else(df_soga$variable == "Cena v predaji:", "sale_price",
if_else(df_soga$variable == "Konečná cena:", "final_price",
if_else(df_soga$variable == "Nevydražené", "unsold",
if_else(df_soga$variable == "Odhadovaná cena:", "assumed_price",
if_else(df_soga$variable == "Poradové číslo:", "item_number",
if_else(df_soga$variable == "Vyvolávacia cena:", "starting_price",df_soga$variable)))))) # works but looks ugly - sorry folks
The code above is five times nested if_else command - something I am not particularly proud of. It might be a more elegant way to code this but I haven’t found it yet.
Next step is to create an ID collumns that identify author & artwork and spread the data from long to wide - in line with the tidy data manifest written by Hadely Wickham.
df_soga_wide <- df_soga %>%
mutate(id = paste(name, piece)) %>% # create unique id per row
split(.$id) %>% # split the data per unique artwork
map(safely(spread),variable, value) %>% # spread the values - use safely to prevent errors
map_df("result") # filter list with suscessfull results
glimpse(df_soga_wide)
Now, we have one row per piece and variables stored in columns. This enables a quick and consistent way of working the data. Let’s take a glimpse into the data and let’s find out the most frequently sold author at the auction house.
df_soga_wide$name %>%
table() %>%
as.tibble() %>%
arrange(desc(n)) %>%
set_names(c("artist", "count")) %>%
top_n(10) %>%
DT::datatable()
## Selecting by count
Great! Now we see that the most frequently sold artist was a unknown central European artist with around 360x sold paintings.
Unfortunately, the data is still not tidy. My poorly written web scraping code is responsible for this messy dataset. (I am sure there is a better R code to scape data more accurately)
Now, lets tidy up columns containing the messy data. Columns are still character variables and errors prohibiting transformation into tidy are:
Let’s remove the errors using string manipulation techniques!
# tidy the columns with variables containing price informations
df_soga_tidy <- df_soga_wide %>%
modify_at(c(5,6), ~ str_replace_all(.,"€","")) %>% # modify_at helps identify collums more efficiently then naming them using mutate
modify_at(c(5,6), ~ str_replace_all(.," ","")) %>%
modify_at(c(5,6), ~ str_replace_all(.,"Neurčená","")) %>%
modify_at(c(5,6), ~ str_replace_all(.,",","")) %>%
modify_at(c(5,6), ~ str_replace_all(.,"EUR","")) %>%
modify_at(c(5,6), ~ str_trim(.)) %>%
modify_at(c(5,6), as.numeric)
# tidy the column names
colnames(df_soga_tidy) <- df_soga_tidy %>%
colnames() %>%
str_to_lower() %>%
str_replace_all(":","") %>%
str_replace_all(" ","_")
# tidy the date of birth & death into two separate collumns
living_years <- str_extract_all(df_soga_tidy$name, one_or_more(DIGIT),simplify = T) %>%
as.tibble() %>%
set_names(c("birth_year","death_year","birth_year_I","death_year_I")) %>%
select("birth_year","death_year") %>%
modify_if(is.character, as.numeric)
df_soga_tidy <- df_soga_tidy %>%
bind_cols(living_years)
# remove years from name column
df_soga_tidy$name <- df_soga_tidy$name %>%
str_replace_all(pattern = DIGIT,"") %>%
str_replace_all("\\)"," ") %>%
str_replace_all("\\(","") %>%
str_replace_all("-","") %>%
str_replace_all("–","") %>%
str_trim() %>%
str_to_upper()
# tidy & include assumed_price in the final price
df_assumed_price_tidy <- df_soga_tidy$assumed_price %>%
str_replace_all("-", "/") %>%
str_replace_all("– ","/") %>%
str_replace_all("€","") %>%
str_replace_all("EUR","") %>%
str_replace_all(",00","") %>%
str_replace_all(",","") %>%
str_replace_all(space(),"") %>%
str_split("/",simplify = T) %>%
as.tibble() %>%
modify(as.numeric) %>%
mutate(assumed_price = (V1 + V2) / 2) %>%
select(assumed_price)
df_soga_tidy <- bind_cols(dplyr::select(df_soga_tidy, -assumed_price),
df_assumed_price_tidy)
df_soga_tidy %>% glimpse()
## Observations: 15,978
## Variables: 10
## $ name <chr> "/JIŘÍ/ GEORG DOKOUPIL", "A D.WEGENER ?", "A....
## $ piece <chr> "Erotický motív", "Horská krajina", "Česká kraj...
## $ id <chr> "/JIŘÍ/ GEORG DOKOUPIL (1954) Erotický motív", ...
## $ item_number <chr> "154", "178", "291", "207", "201", "192", "214"...
## $ starting_price <dbl> 465, 929, 597, 996, 1062, 398, 664, 597, 830, 4...
## $ final_price <dbl> NA, NA, NA, 1394, NA, NA, NA, 597, NA, NA, 1461...
## $ sale_price <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ birth_year <dbl> 1954, 1891, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ death_year <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ assumed_price <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
Much better!!! With the above-written code we have achieved:
Since we have some tidy data now, let’s create some new ratios.
df_soga_tidy <- df_soga_tidy %>%
mutate(starting_price = if_else(is.na(starting_price), assumed_price, as.numeric(starting_price)), # if variable starting_price is NA use assumed_price instead
diff_price_rel = ((final_price - starting_price) / (starting_price)), # relative difference between starting price and final price
diff_price_abs = final_price - starting_price , # absolute difference between starting price and final price
bin_price = ntile(starting_price, 4), # create bins by starting price
bin_diff_price_rel = ntile(diff_price_rel, 4)) # create bins by relative profit & losses
With tidy data lets focus on analyzing the data. This part, is always a cherry on the cake after the painful process of data acquisition & cleaning. (Jenny Bryans quote: “Behind every great plot there’s a great deal of wrangling”) - is a accurate one. Link here
First, let’s look at the highest prices paid for painting on auction:
df_soga_tidy %>%
drop_na(final_price) %>%
top_n(wt = final_price,n = 10) %>%
ggplot(aes(reorder(piece, final_price),final_price, fill = name)) +
geom_col() +
coord_flip() +
labs(title = "Highest price paid per paining",
subtitle = "sorted by name of the artwork",
caption = "Source: www.soga.sk") +
theme_minimal() +
scale_y_continuous(labels = scales::comma) +
theme(legend.text = element_text(size = 6),
legend.title = element_text(size = 6),legend.position = "bottom")
The highest auctioned painting was for the “Chlapec s Dalmatinom” by DOMENICHINO DOMENICO ZAMPIERI. Interestingly it failed to reach its owner price of 663 878 EUR and was sold at a loss of (-132 775 EUR or -19%) for final price of 531 103 EUR.
The second chart shows paintings sold at the highest relative profit:
df_soga_tidy %>%
drop_na(diff_price_rel) %>%
top_n(wt = diff_price_rel,n = 15) %>%
ggplot(aes(reorder(piece, diff_price_rel),diff_price_rel)) +
geom_col() +
coord_flip() +
theme_minimal() +
labs(title = "Highest profit generated by painting (in %)",
subtitle = "relative, sorted by name of the artwork",
y = "Profit (in %)",
x = "Artwork",
caption = "Source: www.soga.sk") +
scale_y_continuous(labels = scales::percent)
Surprisingly, the highest profits were in the range of 10x - 15x fold return This is a great result for the investor. The most profitable painting was - “Zátišie s ľudovým džbánom” by KRIVOŠ RUDOLF ( link ). This painting was also an outlier as rest of KRIVOŠ paintings generated far lower returns.
Another point of view is to visualize the most profitable paintings in the absolute terms.
df_soga_tidy %>%
drop_na(diff_price_abs) %>%
top_n(wt = diff_price_abs,n = 15) %>%
ggplot(aes(reorder(piece, diff_price_abs),diff_price_abs)) +
geom_col() +
coord_flip() +
theme_minimal() +
labs(title = "Highest profit generated by painting (in €)",
subtitle = "absolute, sorted by name of the artwork",
y = "Profit (in %)",
x = "Artwork",
caption = "Source: www.soga.sk") +
scale_y_continuous(labels = scales::comma)
Painting V žatve was the paiting sold at the highest profit of about 85 000 EUR. This is an equivalent of the price paid for small 1 bedroom flat in Bratislava.
Now, let’s visualize the result of the SOGA auctions:
df_loss_profit <- df_soga_tidy %>%
drop_na(final_price, bin_price)
library(plotly)
plotly::plot_ly(df_loss_profit,
x = ~starting_price,
y = ~ final_price,
color = ~ bin_diff_price_rel,
text = ~paste("Name: ", name, "\n",
"Artwork:",piece,"\n",
"Starting Price: ", scales::comma(starting_price), "\n",
"Final Price:",scales::comma(final_price),"\n",
"Return (%):",scales::percent(diff_price_rel),"\n",
"Return (€):", scales::comma(diff_price_abs))) %>%
layout(title = "Profits & Losses @ SOGA Auction House",
xaxis = list(title = "Starting Price",range = c(0, 200000)),
yaxis = list(title = "Starting Price",range = c(0, 200000)))
Now it is pretty clear that the outliers where price difference (between starting and final price) was big. This means some paintings fail to meet the expectation and was traded well below the starting price. Another observation - Profits are getting smaller the more we move up the price tag.
Finally, let’s make a summary statistics with the most profitable artist sold at the SOGA auction house.
df_soga_tidy %>%
drop_na(final_price,diff_price_rel,diff_price_abs) %>%
group_by(name) %>%
summarise(profit_relative = median(diff_price_rel, na.rm = T) %>% scales::percent(),
profit_absolute = median(diff_price_abs, na.rm = T) %>% scales::comma(),
artwork_price = median(final_price, na.rm = T) %>% scales::comma(),
count = n(),
sum = sum(diff_price_abs) %>% scales::comma()) %>%
filter(count > 5) %>%
arrange(desc(profit_relative)) %>%
dplyr::top_n(wt = profit_relative, n = 15) %>%
knitr::kable()
| name | profit_relative | profit_absolute | artwork_price | count | sum |
|---|---|---|---|---|---|
| LADISLAV GUDERNA | 87.7% | 2,257 | 4,149 | 7 | 11,284 |
| EDITA SPANNEROVÁ | 86.7% | 249 | 713.5 | 8 | 2,754 |
| BIZMAYER IGNÁC | 85.6% | 325 | 1,325 | 14 | 7,820 |
| STRUHÁR ALOJZ | 8.33% | 25 | 310 | 6 | 200 |
| MARTIN SEDLÁK | 8.28% | 66 | 1,062 | 7 | 631 |
| ZBYNĚK PROKOP | 76.5% | 50.5 | 116.5 | 6 | 302 |
| JÚLIUS JAKOBY | 74.2% | 3,071 | 7,800.5 | 6 | 16,000 |
| FRANTIŠEK LIPTÁK | 7.93% | 100 | 746.5 | 12 | 1,661 |
| VINCENT HLOŽNÍK | 7.49% | 132 | 1,461 | 19 | 14,736 |
| DÁVID BAFFI | 66% | 49.5 | 99.5 | 6 | 397 |
| JOZEF JANKOVIČ | 64% | 531 | 1,759 | 9 | 15,469 |
| FRANTIŠEK REICHENTÁL | 62.5% | 132 | 1,162 | 9 | 11,021 |
| KĽÚČIK PETER | 60% | 90 | 240 | 17 | 6,700 |
| MILOŠ ALEXANDER BAZOVSKÝ | 6.31% | 67 | 3,485 | 21 | 26,457 |
| KERN PETER JÚLIUS | 5.56% | 10 | 470 | 10 | 730 |
The results are filtered with the condition of the artist selling at least 5 paintings trough SOGA to limit the number of records and provide liquidity for the investor. Owning artwork of authors that are rarely trade is a liquidity risk for the investor and is recommended to avoid.